# Data Manipulation and Linear Algebra
import pandas as pd
import numpy as np
# Machine Learning and Data Preprocessing
from sklearn.decomposition import PCA, TruncatedSVD, NMF
from sklearn.preprocessing import StandardScaler
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Web Interaction and Display
from IPython.display import Image, display, HTML
# Warning Handling
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
# Additional JavaScript for toggling code display in Jupyter Notebooks
HTML(
"""
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
"""
)
ABSTRACT
This study presents a comprehensive analysis of Spotify's current market situation and the strategic approach to address its challenges. Spotify, a major player in the music streaming industry with a vast library of over 100 million songs, has been experiencing a decline in market share since 2020, partly due to controversies and competition. This analysis aims to counter this trend by enhancing user engagement and satisfaction through more accurate and diverse music recommendations, focusing on discovering emerging artists and tracks using advanced clustering and recommendation techniques.
The methodology involved using Spotify's Web API and web scraping to collect data from 1,260 unique tracks across 126 genres. The team developed a web application to gather users' top tracks, employing Principal Component Analysis (PCA) for dimensionality reduction and identifying eight song archetypes. Initial attempts with Non-negative Matrix Factorization (NMF) were unsuccessful, leading to the adoption of PCA. The permutation test determined the optimal number of principal components, settling on three for the most effective analysis.
The results revealed significant coefficients in these principal components related to various music features like energy, loudness, danceability, and valence, allowing classification of songs into different archetypes based on their characteristics. These archetypes were then applied to users' top tracks to categorize their music preferences, aiming to enhance personalization in music recommendations. The study's innovative approach and use of advanced data analysis techniques offer a potential solution to Spotify's market share challenges, providing a more personalized and dynamic user experience for its users.
To enhance the study's impact in the music streaming industry, key improvements include integrating a user-friendly web application with QR code technology, ensuring stringent data privacy measures, and expanding the PCA model's dataset through partnerships for access to a wider range of Spotify tracks. Additionally, introducing personalized, dynamic sample playlists based on users' musical archetypes with feedback mechanisms can refine the recommendation algorithm. Transforming this tool into a viral marketing strategy by leveraging music archetype identification in social media campaigns and incorporating social features can increase user engagement and potentially revolutionize the approach of music streaming services like Spotify.
A key differentiator of this method is its applicability to new users, as it does not rely on long-term listening history but can classify music preferences based on a newly created playlist. This immediate, personalized engagement could be a valuable feature for music streaming services, enhancing user interaction and satisfaction from the outset.
INTRODUCTION
Background
Spotify, launched in 2008 in Sweden, has revolutionized the way we access and enjoy music. As one of the world's leading music streaming platforms, it offers a vast library of songs, podcasts, and videos from artists all over the globe. Spotify stands out for its user-friendly interface and personalized experience, providing features like custom playlists, radio stations, and music recommendations based on individual listening habits. Its freemium model allows users to access basic features for free with advertisements, while a premium subscription offers additional benefits like ad-free listening, offline play, and improved sound quality. With its advanced algorithms and data analytics, Spotify not only caters to the diverse tastes of millions of users but also offers a valuable platform for artists to reach a global audience, making it a key player in the digital music industry. (Wikipedia Contributors, 2019b)
As of 2021, Spotify had 456 million monthly active users worldwide. As seen on Figure 1, this number is significant as it highlights Spotify's position as a leading player in the music streaming industry. Another thing to note is Spotify's revenue for 2021 was β¬9.67 billion, which shows a substantial growth in the company's financial performance. Maybe this is because of its diverse content. As of 2023, there were over 100 million songs available on Spotify as of the second quarter of 2021.(Georgiev, n.d.)
Problem Statement
However, Spotify has seen its once-dominant market share gradually diminish since the fourth quarter of 2020 (See Figure 2). In 2018, Spotify held a commanding 33.2% of the market, but this influence has been waning and is only at 30.5% as of 2023 with a big gap followed by Apple at 13.7%. The reasons behind this decline are not entirely clear, but several factors could be contributing. One possibility is the controversy surrounding Spotify's decision to host contentious content, such as Joe Rogan's podcasts. Other potential reasons include criticisms over unfair payouts to artists and issues with buggy features. This situation has opened the door for Spotify's competitors to gain a foothold in the audio-streaming market, each vying for a share of this dynamic industry.(Georgiev, n.d.)
Spotify appears to be facing challenges in retaining its subscriber base. In an effort to engage users, they've introduced features like Spotify Wrapped, an annual event that provides users with a personalized summary of their listening habits over the year. Despite such initiatives, it's clear that Spotify is still experiencing a decrease in its market share. This suggests that these efforts, while creative, may not be sufficiently addressing the underlying issues that lead to subscriber attrition. It's possible that factors such as competition from other streaming services, changes in consumer preferences, or dissatisfaction with Spotify's features and pricing model are contributing to this decline. To reverse this trend, Spotify might need to delve deeper into understanding and addressing the specific reasons why users are choosing other options.
Objectives
As Spotify faces the challenge of a slowly declining market share, a key strategy to counter this trend lies in enhancing user engagement and satisfaction. This project, by providing more accurate and diverse music recommendations, aims to deepen user engagement and foster loyalty. It focuses on the discovery of emerging artists and tracks, leveraging nuanced clustering and recommendation techniques. These methods are particularly adept at uncovering less mainstream artists and tracks, potentially aligning with a user's evolving tastes but remaining undiscovered through traditional algorithms. Furthermore, the adaptability of this approach is crucial, as musical preferences can change over time. By regularly analyzing a user's top tracks, ensuring its continued relevance and effectiveness. This strategy could be the solution in addressing Spotify's market share challenges, offering a more personalized and dynamic user experience.
Below are the objectives of this study:
- Development of Musical Archetypes: Create distinct musical archetypes using advanced machine learning techniques (NMF and PCA/SVD).
- Wide Genre Analysis: Analyze a broad range of tracks across diverse genres to ensure comprehensive archetype development.
- Feature-Based Categorization: Utilize key Spotify audio features such as danceability, acousticness, and liveness for categorizing songs into archetypes.
- Personalized Music Discovery: Provide users with insights into their song preferences based on these archetypes, enhancing personalization in music discovery.
- Exploration and Engagement: Encourage users to explore new genres and artists that align with their identified archetypes, increasing engagement and satisfaction.
- Enhanced User Experience: Aim to improve the overall user experience on the platform by offering more tailored and resonant music recommendations.
DATA SOURCES AND DESCRIPTION
| acousticness | float | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| danceability | float | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| duration_ms | integer | The duration of the track in milliseconds. |
| energy | float | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. |
| instrumentalness | float | The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| key | integer | The converted mana cost of the card. |
| liveness | float | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| loudness | float | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. |
| mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | float | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. |
| tempo | float | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| time_signature | integer | An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4". |
| valence | float | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
Table 1 shows the audio features extracted for processing. The Spotify API, officially known as the Spotify Web API, is a powerful tool that allows developers to access a wide range of Spotify's music data and functionalities. It provides programmatic access to information about artists, albums, tracks, and playlists on the Spotify platform. To see how it was done, refer to the Data Extraction notebook. More details about the data extraction process are found there.
The Spotify API responses are returned in JSON format, making it accessible and easy to use for developers familiar with web technologies. It's a powerful tool for anyone looking to build applications or services that interact with the rich world of music and user data on Spotify. Table 1 shows the audio features analyzed in this study (Spotify, n.d.).
Figure 3 shows the landing page of the Web App. This web application will ask a user to log in using their spotify account and authorize to get data, and ultimately return the top tracks of that user. This helped the team gather user data to test out their approach in assigning archetypes of a user.
METHODOLOGY
From Figure 4 above, this study utilized Spotify's Web API and web scraping techniques to gather audio features and additional data like genres and track names from 1,260 unique tracks, spread across 126 genres. To enhance user engagement, the team developed a web application that retrieves a user's top tracks from Spotify, circumventing the API's authentication limitations. This data, along with the team's tracks, was processed for analysis. The exploratory data analysis (EDA) began with a failed attempt to cluster songs using Non-negative Matrix Factorization (NMF) due to a lack of clear latent factors. Consequently, the team opted for Principal Component Analysis (PCA), including a permutation test to determine the necessary number of principal components. The PCA results led to the identification of eight song archetypes. Lastly, the study applied these archetypes to users' top tracks, including those from MSDS 2024 students, to categorize their music preferences.
Step-by-Step Process:
- Data Collection: Gather audio features, genres, and track names for 1,260 tracks using Spotify's Web API and web scraping.
- User Engagement Tool: Develop a web app to fetch user-specific top tracks from Spotify.
- Exploratory Data Analysis (EDA): Process the gathered data for analysis.
- Attempt at NMF: Try and fail to cluster songs using NMF due to indistinct latent factors.
- Principal Component Analysis (PCA): Implement PCA and a permutation test to identify the necessary number of components.
- Interpretation of PCA Results: Determine and define eight song archetypes based on PCA coefficients.
- Application of Archetypes: Use the identified archetypes to analyze music preferences based on users' top tracks.
EXPLORATORY DATA ANALYSIS
To know more about the data the team performed EDA.
They wanted to ask preliminary questions about the two data they have, the user tracks and the track pool.
- In the user data, who were the most occuring artists?
- In the track pool data, who were the most occuring artists?
- What are the correlation of each plot from the track pool data?
- What is the distribution of the track pool based on its features?
# Artist Count in top tracks
df_tracks = pd.read_csv("top_tracks.csv")
artist = df_tracks["artist"].value_counts()[:5].to_frame()
plt.figure(figsize=(10, 6), dpi=250)
plt.barh(
artist.index,
artist["count"],
color="#1ccf54",
)
plt.xlabel("Count")
plt.xticks(rotation=0, ha="right")
plt.title("Artist Count")
plt.show()
Based on Figure 5 above, we asked the question, among the tracks in the user data, who was the most occuring artist. This would give somehow a grasp of what kind of music the users are listening to based on the artist.
Evidently, Taylor Swift is the most occuring artist and she really has been trending lately because of her ERAS world tour. Considering also the people who were asked here are Filipinos, there are only 2 Filipino artists here, Munimuni and Armi Millare.
# Artist Count in the track pool
df_pool = pd.read_csv("track_pool.csv")
artist = df_pool["artist"].value_counts()[:5].to_frame()
plt.figure(figsize=(10, 6), dpi=250)
plt.barh(artist.index, artist["count"], color="#1ccf54")
plt.xlabel("Count")
plt.xticks(rotation=0, ha="right")
plt.title("Artist Count")
plt.show()
The team was also concerned about the track_pool they have collected. Figure 6 above shows the most occuring artists considering all 126 genres. EDM artists such as Daft Punk, Skrillex and deadmau5 are here maybe because they lead that genre of music. This means that the artists listed here most likely specialize in only one genre and they do a good job about it that they stay on the top 10.
# Correl Plot of Audio Features
num = df_pool.select_dtypes(exclude="object").columns
num_df = df_pool[num]
correlation_matrix = num_df.corr()
plt.figure(figsize=(10, 8), dpi=250)
sns.heatmap(correlation_matrix, cmap="YlGnBu", linewidths=0.5)
plt.title("Correlation Heatmap of Audio Features")
plt.show()
The team analyzed a diverse set of tracks from 126 music genres, focusing particularly on the relationships between various musical features. The correlation plot in Figure 7 revealed that certain features tend to be linked. For instance, loudness and energy showed a positive correlation, which is logical as songs with higher energy often are louder. Similarly, valence and danceability were also positively correlated, indicating that songs with a cheerful and positive tone are more likely to be danceable.
On the other hand, acousticness demonstrated an inverse relationship with both loudness and energy. This finding aligns well with the nature of acoustic music, which typically leans towards a mellower sound, in contrast to the loud and high-energy characteristics of some other music genres.
audio_features = [
"danceability",
"energy",
"loudness",
"speechiness",
"acousticness",
"instrumentalness",
"liveness",
"valence",
"tempo",
]
df_pool = df_pool[audio_features]
# Create a single figure with 9 subplots (one for each audio feature)
plt.figure(figsize=(18, 12))
for i, feature in enumerate(audio_features, start=1):
plt.subplot(3, 3, i)
sns.boxplot(data=df_pool[feature], color="#1ccf54", width=0.5,
linewidth=1.5)
plt.title(feature.title(), fontsize=14)
plt.tight_layout()
plt.show()
The box plot visualization in Figure 8 summarizes the distribution of various musical attributes across songs from 126 genres, indicating that danceability, energy, and tempo have a relatively broad but consistent range across songs, while loudness levels vary widely with several outliers. Speechiness, acousticness, and instrumentalness have many low-scoring songs, suggesting most songs are not speech-dominated, not acoustic, and contain vocal elements. Liveness is generally low, implying studio recordings rather than live performances, and valence shows a balanced spread, indicating a mix of positive and negative musical tones. This analysis could be useful for understanding music trends and preferences across genres.
DIMENSIONALITY REDUCTION
In streamlining our data analysis, we ventured into dimensionality reductionβan intricate process aimed at trimming down the dimensions in our dataset for enhanced interpretability and computational efficiency. Opting for Principal Component Analysis (PCA) among various methodologies, we found it to be the most interpretable and time-efficient. Its brief runtime facilitated easy adjustments and iterative runs, aligning seamlessly with our time constraints. Upon obtaining our Principal Components, the challenge shifted to determining the optimal number of PCs. Enter the Permutation Testβan instrumental tool in our arsenal. This method illuminated a pivotal insight: beyond three Principal Components, any additional dimensions merely introduced noise into our model. Thus, we pinpointed the sweet spot for our ideal PCs and achieved an effective dimensionality reduction.
However, the team started their initial analysis with Nonnegative Matrix Factorization. The initial plan was to cluster songs based on their audio features as seen below.
NMFΒΆ
def clean_audio_features(consolidated_audio_features):
"""
Cleans and preprocesses audio feature data.
Parameters
----------
consolidated_audio_features : DataFrame
A pandas DataFrame containing the audio features.
Returns
-------
DataFrame
A cleaned DataFrame with normalized audio features, excluding 'key'
and 'mode' features.
"""
# Retain only the audio features without 'mode' and 'key'
unwanted_features = ["key", "mode"]
wanted_features = [
feature
for feature in consolidated_audio_features
if feature not in unwanted_features
]
cleandf_audio_features = consolidated_audio_features.reindex(
columns=wanted_features
)
# Get only the columns of audio features
cleandf_audio_features = cleandf_audio_features.iloc[:, :9]
wanted_features = list(cleandf_audio_features.columns)
# All matrix entries must be nonnegative
cleandf_audio_features = np.abs(cleandf_audio_features)
# Scale the features to avoid over/underemphasizing of features
# Use min-max scaling
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
cleandf_audio_features = min_max_scaler.fit_transform(
cleandf_audio_features)
cleandf_audio_features = pd.DataFrame(
cleandf_audio_features, columns=wanted_features
)
return cleandf_audio_features
# Track Pool for NMF
consolidated_audio_features = pd.read_csv('track_pool.csv')
df_audio_features = clean_audio_features(consolidated_audio_features)
df_audio_features.head()
| danceability | energy | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.748892 | 0.807101 | 0.249543 | 0.015001 | 0.044778 | 0.000041 | 0.034801 | 0.793684 | 0.362562 |
| 1 | 0.559141 | 0.275624 | 0.283591 | 0.008526 | 0.761044 | 0.003582 | 0.079663 | 0.287368 | 0.574333 |
| 2 | 0.521645 | 0.331887 | 0.291086 | 0.038420 | 0.157630 | 0.930612 | 0.050919 | 0.596842 | 0.305260 |
| 3 | 0.596637 | 0.206301 | 0.218620 | 0.008526 | 0.714859 | 0.000000 | 0.127400 | 0.295789 | 0.514339 |
| 4 | 0.676173 | 0.189222 | 0.323254 | 0.027736 | 0.949799 | 0.000007 | 0.331691 | 0.431579 | 0.483237 |
# Initiating the NMF model
nmf_model = NMF(max_iter=10000)
A = np.array(df_audio_features)
W = nmf_model.fit_transform(A)
# W.shape Should be (number of archetypes, number of tracks)
H = nmf_model.components_.T
# H.shape Should be (number of archetypes, number of audio features)
for archetype, coefficients in enumerate(H):
# 'coefficients' is the importance of each audio feature to an archetype
# Using the coefficients, get the index of the top five audio features
top_feature_indices = coefficients.argsort()[::-1][:5]
# Get the corresponding audio feature of each index
top_feature_names = df_audio_features.columns[top_feature_indices]
print(f"Archetype {archetype+1} Top Features:")
print(list(top_feature_names), end="\n\n")
# Add back the id of each track
audio_matrix = pd.concat(
[df_audio_features.iloc[:, :11], consolidated_audio_features["id"]], axis=1
)
# Assign an archetype to each user using W
# Matrix entries of W refer to how much of each archetype is found in each track
# The highest matrix entry gives the archetype that is dominant in each track
audio_matrix["Archetype"] = (
np.argmax(W, axis=1) + 1
) # Add 1 to make the archetype labels start from 1
audio_matrix
Archetype 1 Top Features: ['valence', 'instrumentalness', 'energy', 'tempo', 'liveness'] Archetype 2 Top Features: ['liveness', 'danceability', 'instrumentalness', 'tempo', 'valence'] Archetype 3 Top Features: ['tempo', 'loudness', 'valence', 'liveness', 'instrumentalness'] Archetype 4 Top Features: ['danceability', 'tempo', 'loudness', 'valence', 'liveness'] Archetype 5 Top Features: ['energy', 'tempo', 'valence', 'liveness', 'instrumentalness'] Archetype 6 Top Features: ['loudness', 'danceability', 'energy', 'instrumentalness', 'tempo'] Archetype 7 Top Features: ['acousticness', 'tempo', 'energy', 'valence', 'liveness'] Archetype 8 Top Features: ['instrumentalness', 'speechiness', 'acousticness', 'tempo', 'valence'] Archetype 9 Top Features: ['speechiness', 'liveness', 'valence', 'tempo', 'instrumentalness']
| danceability | energy | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | id | Archetype | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.748892 | 0.807101 | 0.249543 | 0.015001 | 0.044778 | 0.000041 | 0.034801 | 0.793684 | 0.362562 | 0obzwoGjXdORtAwYWZ9s53 | 6 |
| 1 | 0.559141 | 0.275624 | 0.283591 | 0.008526 | 0.761044 | 0.003582 | 0.079663 | 0.287368 | 0.574333 | 4oBEeMxkuW4t7a8qIupetG | 8 |
| 2 | 0.521645 | 0.331887 | 0.291086 | 0.038420 | 0.157630 | 0.930612 | 0.050919 | 0.596842 | 0.305260 | 1BMep2eRJLHOZnpL8Kd0lY | 6 |
| 3 | 0.596637 | 0.206301 | 0.218620 | 0.008526 | 0.714859 | 0.000000 | 0.127400 | 0.295789 | 0.514339 | 7noCkklUhEAoj8GJkbAuHq | 8 |
| 4 | 0.676173 | 0.189222 | 0.323254 | 0.027736 | 0.949799 | 0.000007 | 0.331691 | 0.431579 | 0.483237 | 1SZ63wN0pk18Dr4Epyhcsf | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1246 | 0.539825 | 0.172142 | 0.462217 | 0.019426 | 0.824297 | 0.000110 | 0.102762 | 0.498947 | 0.307914 | 3Fn6K2roSmzhKzp5BBJKJl | 9 |
| 1247 | 0.721623 | 0.616212 | 0.354685 | 0.033887 | 0.653614 | 0.986735 | 0.093522 | 0.662105 | 0.398669 | 2G3CRca8Si7vrCeDain6d0 | 6 |
| 1248 | 0.358027 | 0.542870 | 0.205763 | 0.012519 | 0.283132 | 0.505102 | 0.093522 | 0.283158 | 0.629315 | 6Ru1MJhTTFFnbSUqTGmrtM | 7 |
| 1249 | 0.738666 | 0.683525 | 0.297718 | 0.030218 | 0.038352 | 0.000337 | 0.014783 | 0.798947 | 0.637420 | 2yrAD69v6rP6bZhIjs7kOy | 6 |
| 1250 | 0.602318 | 0.371069 | 0.279398 | 0.015325 | 0.790160 | 0.005449 | 0.064983 | 0.722105 | 0.449578 | 2ARHr1CgtVWdRhlOIuBD61 | 6 |
1251 rows Γ 11 columns
Table 2 presents a snapshot of audio feature data post-scaling, showcasing how the features have been normalized. Following this, Table 3 illustrates the outcome of applying Non-negative Matrix Factorization (NMF) to this data. In this process, the music tracks are categorized into different archetypes based on their audio characteristics. To further understand the impact of this categorization, a sparsity plot is provided in Figure 9 below. This plot visualizes the distribution and density of the values in the dataset after the NMF application, highlighting the presence of non-zero versus zero values and offering insights into the data's structure post-transformation.
# Sparsity Plot of NMF for Non-Zero Elements
audio_features = audio_matrix.columns.tolist()[:9]
fig, ax = plt.subplots(dpi=250)
ax.spy(H)
ax.set_xticks(range(len(audio_features)))
ax.set_yticks(range(len(audio_features)))
ax.set_yticklabels(df_audio_features.columns);
# Reconstruction Error Plot for NMF
n_components = list(range(1, 10))
errors = []
n_components_list = []
for n_component in n_components:
nmf_ = NMF(n_component, max_iter=10000)
nmf_.fit(A)
n_components_list.append(n_component)
errors.append(nmf_.reconstruction_err_)
plt.figure(dpi=250)
plt.plot(n_components, errors, "-o", color="#1ccf54")
plt.xlabel(r"$n_\text{components}$")
plt.ylabel("reconstruction error");
# Print each n_component and its corresponding reconstruction error
for n_components, error in zip(n_components_list, errors):
print(f"Archetypes: {n_components}")
print(f"Reconstruction error: {error:.6f}")
print("-" * 31)
Archetypes: 1 Reconstruction error: 23.087236 ------------------------------- Archetypes: 2 Reconstruction error: 18.141172 ------------------------------- Archetypes: 3 Reconstruction error: 14.174652 ------------------------------- Archetypes: 4 Reconstruction error: 11.223809 ------------------------------- Archetypes: 5 Reconstruction error: 9.082671 ------------------------------- Archetypes: 6 Reconstruction error: 7.077012 ------------------------------- Archetypes: 7 Reconstruction error: 4.829430 ------------------------------- Archetypes: 8 Reconstruction error: 3.618093 ------------------------------- Archetypes: 9 Reconstruction error: 0.259035 -------------------------------
The plot depicted in Figure 10 and the accompanying results indicate that the Non-Negative Matrix Factorization (NMF) fits failed to yield sufficient information for the team to determine the optimal number of components for their analysis. Additionally, the process of interpreting clusters from the sparsity plot proved to be subjectively challenging.
Given the inconclusive outcomes of the NMF analysis, the team decided to explore an alternative dimensionality reduction technique, Principal Component Analysis (PCA).
PCAΒΆ
Principal Component Analysis enabled the team to gain a deeper understanding and interpretation of these features. PCA is a method used to reduce the dimensionality of large datasets, increasing interpretability while minimizing information loss. It does this by transforming the original variables into a new set of variables, the principal components, which are orthogonal (uncorrelated), and which capture the maximum variance in the data. (Wikipedia Contributors, 2019)
Scaling before doing PCA is also important to normalize the values of the audio features.
# Using wanted features for tracks
features = [
"danceability",
"energy",
"loudness",
"speechiness",
"acousticness",
"instrumentalness",
"liveness",
"valence",
"tempo",
]
df_features = df_pool[features]
# Scaling data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_features)
# Fitting on all components
pca = PCA()
X_new = pca.fit_transform(df_scaled)
variance_explained = pca.explained_variance_ratio_
Permutation TestΒΆ
The permutation test on PCA is a powerful method to test the robustness of PCA results, especially when dealing with high-dimensional data where overfitting and chance findings can be a concern. (Jr, 2022)
A permutation test is designed to estimate the underlying distribution of the population from which our observations originate. By understanding this distribution, we can then assess the rarity or commonness of our observed values in comparison to the overall population.(Berk, 2021)
Be aware that permutation tests always yield unique permutations. They sample from an array of all possible permutations, ensuring that each permutation is selected only once and without any repetition.
def de_correlate_df(df):
"""
De-correlate the columns of a DataFrame by shuffling each column
independently.
Parameters
----------
df: pandas.DataFrame
The input DataFrame to de-correlate.
Returns
-------
X_aux: pandas.DataFrame
A new DataFrame with the same columns as the input DataFrame,
but with each column's values shuffled independently.
"""
X_aux = df.copy()
for col in df.columns:
X_aux[col] = df[col].sample(len(df)).values
return X_aux
# Define the number of tests and create a blank matrix
N_permutations = 1000
variance = np.zeros((N_permutations, len(features)))
pca_iter = PCA()
# Generate the new datasets and save the result
for i in range(N_permutations):
X_aux = de_correlate_df(pd.DataFrame(df_scaled))
pca_iter.fit(X_aux)
variance[i, :] = pca_iter.explained_variance_ratio_
p_val = np.sum(variance > variance_explained, axis=0) / N_permutations
# Plot of P-values of Permutation Test
plt.figure(figsize=(4, 2), dpi=250)
plt.plot(range(1, len(variance_explained) + 1), p_val, "o-", color="#1ccf54")
plt.ylim(-0.1, 1.1)
plt.xlabel("Principal Components")
plt.ylabel("p-value")
plt.show()
A permutation test was used to identify the optimal number of archetypes, ensuring objectivity and preserving valuable information. Results indicated that only the first 3 Principal Components are relevant, with more components adding noise.
# Plot for cumulative and single Variance Explained
plt.figure(dpi=250)
plt.plot(
range(1, len(variance_explained) + 1), variance_explained, "o-", color="#1ccf54"
)
plt.plot(
range(1, len(variance_explained) + 1),
variance_explained.cumsum(),
"o-",
color="#f465bf",
)
plt.ylim(0, 1)
plt.xlabel("PC")
plt.ylabel("variance explained")
plt.legend(["Single", "Cumulative"], loc=0);
Based on Figure 12 above, by using 3 principal components, we can explain 61.79% of the variance in the dataset of songs.
RESULTS AND DISCUSSION
DECIDING ARCHETYPES
Interpretation of Principal ComponentsΒΆ
# Extracting the Principal Components
n_components = pca.components_[:3].T
pc = pd.DataFrame(
n_components,
index=features,
columns=["PC" + str(x) for x in range(1, n_components.shape[1] + 1)],
)
# Plotting coefficients of features projected to PC1
colors = ["gray"] * len(features)
colors[4] = "#65D46E"
colors[6] = "#65D46E"
colors[7] = "#65D46E"
plt.figure(figsize=(10, 6), dpi=250)
pc["PC1"][::-1].plot(kind="barh", color=colors)
plt.title("PRINCIPAL COMPONENT 1\n\n" + "βINTENSE vs MELLOWβ")
plt.xlabel("Coefficients")
plt.ylabel("Features")
plt.xlim(-0.6, 0.6)
plt.show()
In Figure 13, the team emphasized the significant coefficients of features in the first Principal Component (PC1), specifically focusing on energy, loudness, and acousticness. This observation aligns closely with insights derived from the correlation plot presented in Figure 7, indicating a consistency in findings across different analyses.
In this component, we can classify songs based on their intensity or mellowness.
# Plotting coefficients of features projected to PC2
colors = ["gray"] * len(features)
colors[1] = "#65D46E"
colors[2] = "#65D46E"
colors[8] = "#65D46E"
plt.figure(figsize=(10, 6), dpi=250)
pc["PC2"][::-1].plot(kind="barh", color=colors)
plt.title("PRINCIPAL COMPONENT 2\n\n" + "βHIGH-SPIRITED vs MELANCHOLICβ")
plt.xlabel("Coefficients")
plt.ylabel("Features")
plt.xlim(-0.6, 0.6)
plt.show()
In Figure 14, the team emphasized the significant coefficients of features in the second Principal Component (PC2), specifically focusing on danceability, valence, and liveness.
In this component, we can classify songs based on whether they seem to be high-spirited (positive) or melancholic.
# Plotting coefficients of features projected to PC3
colors = ["gray"] * len(features)
colors[2] = "#65D46E"
colors[3] = "#65D46E"
colors[5] = "#65D46E"
plt.figure(figsize=(10, 6), dpi=250)
pc["PC3"][::-1].plot(kind="barh", color=colors)
plt.title("PRINCIPAL COMPONENT 3\n\n" + "βNON-LYRICAL vs LYRICALβ")
plt.xlabel("Coefficients")
plt.ylabel("Features")
plt.xlim(-0.6, 0.6)
plt.show()
In Figure 15, the team emphasized the significant coefficients of features in the third Principal Component (PC3), specifically focusing on speechiness, liveness, and instrumentalness.
In this component, we can classify songs based on whether they seem to be non-lyrical(instrumental) or lyrical.
PREDICTION
With enough information from the 3 principal components, the team can now test their interpretations. With the data from the web app, they scaled and transformed it and assigned archetypes for the user as seen below.
# Fitting on optimal number of components
pca = PCA(3)
X_new = pca.fit(df_scaled)
coord = pca.components_.T
# Fitting User to optimal PCA components
df_test = pd.read_csv("top_tracks.csv")
df_test_scaled = scaler.transform(df_test[features])
result = pca.transform(df_test_scaled)
df_result = pd.DataFrame(result, columns=["PC1", "PC2", "PC3"])
df_result["Archetype"] = (
df_result["PC1"].apply(lambda x: 1 if x > 0 else 0)
+ df_result["PC2"].apply(lambda x: 3 if x > 0 else 0)
+ df_result["PC3"].apply(lambda x: 5 if x > 0 else 0)
)
mapper = {
9: "Poignant Poet",
4: "Melancholic Melodist",
6: "Ambient Alchemist",
1: "Serenity Savant",
8: "Vocal Voyager",
3: "Reflective Rythmist",
5: "Lyric Luminary",
0: "High-Spirited Harmonist",
}
df_result["Archetype"] = df_result["Archetype"].map(mapper)
df_result["user"] = df_test["user"].to_list()
most_frequent = (df_result.groupby("user")["Archetype"]).agg(
lambda x: x.mode().iloc[0])
user_index = [f"User {i}" for i in range(1, 6)]
most_frequent.index = user_index
most_frequent
User 1 Ambient Alchemist User 2 Ambient Alchemist User 3 Lyric Luminary User 4 High-Spirited Harmonist User 5 Ambient Alchemist Name: Archetype, dtype: object
To obtain these results, each track of the user is first categorized under a specific archetype. Following that, to determine the user's overall archetype, the archetype that appears most frequently among their tracks is chosen as the representative archetype for the user.
A list of all archetypes are in the Appendix Section in Figure 16 together with their description. Keep in mind, the team sought help from LLMs like ChatGPT to be creative in naming their archetypes. However, the team always kept feeding context to these LLMs to provide the output that they need.
CONCLUSION
This study presents a novel approach that could significantly enhance user engagement for music streaming services like Spotify, going beyond existing features such as Spotify Wrapped. The key innovation lies in the use of Principal Component Analysis (PCA) to interpret and classify songs based on their audio features, assigning unique characters or archetypes to songs based on these features.
The distinctiveness of this method, compared to current market offerings, is its ability to engage even new users of the app. Unlike Spotify Wrapped, which relies on a user's historical listening data to generate year-end summaries, this new approach can provide immediate value to a user who has just signed up and created a playlist. It doesn't require a long history of user data to be effective; instead, it analyzes the audio characteristics of the songs in a user's playlist to assign them to specific archetypes. This immediate feedback loop can be particularly appealing to new users, offering an engaging and personalized experience right from the start.
For Spotify, incorporating this technique means they can offer a more dynamic and interactive feature that not only retains existing users but also attracts new ones. This will help their dwindling market share by providing insights and classifications based on the intrinsic qualities of the music that their users enjoy. Users can discover new dimensions to their musical tastes and explore similar tracks they might not have encountered otherwise.
In essence, this study lays the groundwork for a feature that could add a significant value proposition to music streaming services. It's a step towards a more intuitive and personalized user experience, aligning with the growing trend of leveraging data analytics and AI in enhancing customer engagement in the digital entertainment industry.
RECOMMENDATION
To further enhance the study and make it more impactful in the context of the music streaming industry, several key improvements are recommended. Firstly, the integration of the web application into the overall pipeline is crucial. This app should be designed for ease of use, incorporating QR code technology for quick user access and immediate identification of their music archetype. This integration must prioritize user data privacy and security, with transparent communication about data usage and consent.
Expanding the data collection for the PCA model is another critical step. The current dataset of 1,260 tracks should be significantly enlarged to include a more comprehensive range of tracks with audio features from Spotify. This expansion could involve forming partnerships with Spotify or other music databases for access to a larger dataset and setting up a continuous data collection mechanism to keep the dataset up-to-date with new music trends.
Another innovative feature would be the generation of personalized sample playlists based on users' identified archetypes. These playlists should be dynamic, reflecting users' current preferences and encouraging musical exploration. Feedback mechanisms on these playlists could further refine the recommendation algorithm.
Finally, transforming this project into a viral marketing tool could significantly increase its reach and impact. The unique aspect of music archetype identification can be leveraged in marketing campaigns, encouraging users to share their archetypes on social media. Collaborations with influencers and music communities, along with integrating social features like archetype comparisons with friends or public figures, could enhance user engagement and make this feature a trendsetter in the industry.
Implementing these enhancements would not only improve user experience but also potentially revolutionize user engagement, offering a novel and engaging tool that could address the challenges faced by music streaming services like Spotify.
ReferencesΒΆ
Berk, M. (2021, September 21). How to use Permutation Tests. Medium. https://towardsdatascience.com/how-to-use-permutation-tests-bacc79f45749
Georgiev, D. (n.d.). 17 Mind-Blowing Spotify Statistics for 2022. Techjury. https://techjury.net/blog/spotify-statistics/
Jr, T. T. (2022, January 4). PCA 102: Should you use PCA? How many components to use? How to interpret them? Medium. https://towardsdatascience.com/pca-102-should-you-use-pca-how-many-components-to-use-how-to-interpret-them-da0c8e3b11f0
Spotify. (n.d.). Get Track's Audio Features. Spotify for Developers. Retrieved November 24, 2023, from https://online.stat.psu.edu/stat505/lesson/11/11.4
Spotify. (n.d.). Web API. Spotify for Developers. Retrieved November 24, 2023, from https://developer.spotify.com/documentation/web-api/
Wikipedia Contributors. (2019, March 26). Principal component analysis. Wikimedia Foundation. https://en.wikipedia.org/wiki/Principal_component_analysis
Wikipedia Contributors. (2019, June 13). Spotify. Wikimedia Foundation. https://en.wikipedia.org/wiki/Spotify
AppendixΒΆ
Description of the ArchetypesΒΆ
The Serenity Savant:"The Serenity Savant" embodies users who find solace and joy in the serenity of melodies, leaning towards mellow tunes rather than upbeat rhythms. Their playlist is curated with a selection of happy and uplifting tracks, steering away from the melancholic and reflective. For "The Serenity Savant," the instrumental prowess of a song takes precedence, as they appreciate the emotive power of music without the need for lyrics.
The Poignant Poet: "The Poignant Poet" epitomizes Spotify users who prefer the subtlety of mellow tunes over energetic beats. This archetype curates a playlist that weaves a profound narrative through predominantly sad and reflective tracks. In contrast to exuberant tones, "The Poignant Poet" finds solace and resonance in the lyrical and vocal aspects of music, valuing the expressive power that accompanies profound lyrics.
The Melancholic Melodist: "The Melancholic Melodist" represents users who lean towards mellow tones, preferring a contemplative vibe over energetic beats. The curated playlist of "The Melancholic Melodist" is a collection of soul-stirring and sad tracks, reflecting a taste for music that elicits reflective emotions. In the domain of vocals, this archetype finds comfort in the instrumental, appreciating the expressive power inherent in music without the need for lyrics.
The Vocal Voyager: "The Vocal Voyager" encapsulates Spotify users with a penchant for musical journeys characterized by mellow melodies over upbeat rhythms. This archetype curates a playlist that prioritizes lyrical depth over instrumental simplicity, seeking narratives and stories within the song's vocal expressions. For "The Vocal Voyager," the emotive power of lyrics intertwines with the soothing tones of mellow compositions, creating a distinct musical realm that invites listeners to explore the richness of lyrical and melodic landscapes on Spotify.
The Reflective Rhythmist: "The Reflective Rhythmist" epitomizes users who find solace in the haunting melodies and poignant rhythms of sad songs. They are drawn to the rhythmic heartbeat of melancholic tunes, appreciating the profound connection between soul-stirring rhythms and emotive lyrics. Their Spotify library is a curated collection of sonorous compositions that serve as a companion in moments of introspection and deep reflection. The Reflective Rhythmist understands the therapeutic power of sad songs, using the rhythmic cadence to navigate and express the complex tapestry of emotions that define the human experience.
The Lyric Luminary: "The Lyric Luminary" is a Spotify archetype embodying users who possess a deep appreciation for songs with rich and meaningful lyrics or vocals. These enthusiasts curate playlists that showcase the artistry of storytelling through music, diving into a world where the power of words is as significant as the melody. The Lyric Luminaries are avid explorers of diverse genres, seeking tracks that resonate emotionally and lyrically. Their Spotify library serves as a collection of lyrical masterpieces, reflecting their commitment to the expressive and poetic side of music.
The High-spirited Harmonist:"The High-spirited Harmonist" is the epitome of exuberance on Spotify. This archetype is dedicated to the pursuit of joy and positive vibes through music. With a preference for energetic beats and lively melodies, these users curate playlists filled with upbeat tunes that bring a smile to their face. They find their musical sanctuary in instrumental compositions, reveling in the power of music to convey happiness without the need for lyrics. For them, music is the key to maintaining a high-spirited and uplifting atmosphere in every moment.
The Ambient Alchemist: "The Ambient Alchemist" is the Spotify archetype that finds enchantment in the alchemy of mellow, instrumental tunes, favoring songs that evoke calmness and introspection. With a predilection for the instrumental realm, "The Ambient Alchemist" is on a continuous quest for musical elixirs that transcend words, creating atmospheric soundscapes that transport the listener to serene realms. Their Spotify library is a collection of carefully curated instrumentals, where mellowness takes precedence over upbeat rhythms, and the absence of vocals becomes an essential ingredient in their musical alchemy.